Introduction to Open Data Science - Course Project

Chapter 1

Thoughts about this course:

I enjoyed the soft landing on this course. It was comforting to start and go through the very basics.
From this course, I expect to learn to integrate the use of github and R markdown in my daily work. I also expect to learn the best practices in open data science. I heard about this course through
the email list of my doctoral program.

Reflections on the learning experience:

So far the book and exercises have been a great crash course to R tools and RStudio.I enjoy the style of writing in the book. My favorite section so far is the chapter 4 focusing on plots. I found it useful.

My GitHub repository: https://github.com/kattilakoski/IODS-project


Assignment 2: Analysis

The dataset used in this assignment is from a survey of learning approaches and achievements of students on an introductory statistics course. In this assignment we use a subset of the results including information on gender, age, attitude, total points and approaches (deep, surface and strategic) of 166 students.

Reading data in and graphical overview of the data

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(tidyverse)

#Read the data in:
learning2014 <- read_csv("./data/learning2014.csv")
#Explore dimensions:
dim(learning2014)
## [1] 166   7
#and structure
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p

In the first scatter plot matrix each variable is plotted against each others. The variables are written in a diagonal in the plot. In the plots on the same row with the variable text box the variable is on the y axis of the plot. In the plots in the same column with the variable text box the variable is on the x-axis of the plot.

Based on this scatter plot matrix there are no clearly visible correlations between any of the variables. However, some seem to have some kind of upward or downward trend such as attitude and points as well as deep vs surface strategies.

Graphical overview of the data.
Graphical overview of the data.

In the second plot matrix created with ggpairs() the variables are on the top and left side of the plot. The top right side of the matrix shows the correlation between the continuous variables (age, attitude, deep, strategic, surface), the lower left side shows the scatter plots of the continuous variables, the diagonal the density plots of the continuous variables, and the leftmost column shows the histograms and box plots for the combinations between the categorical and the continuous variables.

From the matrix we can see that majority of the interviewees were female and around 20 years old. Age was not significantly correlating with any variable. Attitude had significant positive correlation with points and significant negative correlation with surface approach. Deep approach had significant negative correlation with surface approach. Strategic approach had negative correlation with surface approach. This matrix confirmed the trends that were visible in the previous scatter plot matrix

Regression model

# create a regression model with three explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# create a regression model without the variables that did not have statistically significant relationship with points
my_model3 <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Based on the results of the t-test used in the linear regression the only significant correlation seems to be a positive correlation between attitude and points (P<0.001). The t-test compares the means of the two groups and tests whether the differences between them are related. As both the strategic and surface variables had a p-value greater than 0.1 they did not have statistically significant relationship with points variable.

The multiple R-squared for the first regression model with all three explanatory variables is 0.2074. This means that only little of the variability in the points is explained by these explanatory variables. Even though the relationship between attitude and points is significant the R-squared value of the regression model including only them is even smaller at 0.1906 meaning that attitude explains even less of the variability in the points.

Diagnostic plots

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5 for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow = c(2,2))
plot(my_model2, c(1,2,5))

#same for the model with only attitude as the explanatory variable
par(mfrow = c(2,2))

plot(my_model3, c(1,2,5))

Diagnostic plots for the regression model with attitude as an explanatory variable.
Diagnostic plots for the regression model with attitude as an explanatory variable.

It seems that there is no distinctive pattern in the Residuals vs fitted plots in either case meaning that there is no non-linear relationship between the explanatory and target variables.

The normal Q-Q plots show somewhat straight lines following the dotted line with exceptions in the beginning and end. These exceptions may indicate that the data is not following a normal distribution.

In the residuals vs leverage plots there would be dotted lines in the upper and lower right corners showing if some of the observations had high Cook’s distance scores meaning they are influential to the regression results and are not following the trend in the rest of the cases. In these plots none of the cases seem to be very influential to the regression results.


Assignment 3 analysis

Reading in and description of the data

#install packages:
# install.packages("boot")
# install.packages("readr")

# access libraries
library(dplyr); library(ggplot2); library(readr); library(tidyr); library(boot)
## Warning: package 'boot' was built under R version 4.2.3
#Read the joined student alcohol consumption data into R
alc <- read_csv("./data/student_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print out the names of the variables in the data 
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data set includes data on student achievements in two Portuguese schools. It includes 370 observations and 35 variables. Variables include for example student grades, sex and age. The dataset was formed by combining two datasets, one regarding the student performance in mathematics and the other in Portuguese language.

###Choosing variables for this analysis For this analysis I choose to focus on age, mother’s education, study time and absences. My hypothesis is that younger students drink less than older ones, the higher the education level of the mother the lower the alcohol consumption of the student is, students studying more drink less alcohol than students studying less and that the more absences the student has the more they drink.

Exploring the distributions of the variables and their relationship with alcohol use

#distribution of alcohol consumption and age
d1 <- ggplot(alc, aes(x = high_use, y = age)) + geom_boxplot() + ggtitle("Age and alcohol consumption")
d1

#distribution of alcohol consumption and mother's education
d2 <- ggplot(alc, aes(x = high_use, y = Medu)) + geom_boxplot() + ggtitle("Mother's education and student alcohol consumption")
d2

#distribution of alcohol consumption and study time
d3 <- ggplot(alc, aes(x = high_use, y = studytime)) + geom_boxplot() + ggtitle("Studytime and alcohol consumption")
d3

#distribution of alcohol consumption and absences
d4 <- ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Absences and alcohol consumption")
d4

#Summary statistics
#age
# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), mean(age))
## # A tibble: 2 × 3
##   high_use count `mean(age)`
##   <lgl>    <int>       <dbl>
## 1 FALSE      259        16.5
## 2 TRUE       111        16.8
#mother's aducation
alc %>% group_by(high_use) %>% summarise(count = n(), mean(Medu))
## # A tibble: 2 × 3
##   high_use count `mean(Medu)`
##   <lgl>    <int>        <dbl>
## 1 FALSE      259         2.80
## 2 TRUE       111         2.79
#study time
alc %>% group_by(high_use) %>% summarise(count = n(), mean(studytime))
## # A tibble: 2 × 3
##   high_use count `mean(studytime)`
##   <lgl>    <int>             <dbl>
## 1 FALSE      259              2.16
## 2 TRUE       111              1.77
#absences
alc %>% group_by(high_use) %>% summarise(count = n(), mean(absences))
## # A tibble: 2 × 3
##   high_use count `mean(absences)`
##   <lgl>    <int>            <dbl>
## 1 FALSE      259             3.71
## 2 TRUE       111             6.38

Based on the distribution plots it looks like my hypothesis is correct and that younger students drink less than older students. When looking at the numerical summary the difference in the mean age is only 0.27 years which seems small. Against my hypothesis it looks like there is no difference in the alcohol use of students with mother’s from different education levels and this is also visible in the summary as the difference between mean education level of mothers is only 0.01. Based on the distributions it looks like my hypothesis on study time was right as the students drinking less seem to study more and this is backed up by the summary as well as there is a difference in the mean study times of students with high and low alcohol use. However, the difference is only 0.40 hours. It seems that my hypothesis was right about the absences as students that drink more seem to have more absences. The difference between absences of students with high and low alchol use is the clearest and the mean absence of students with high alcohol use is 2.67 absences more than students with low alcohol use.

Logistic regression, odd ratios and coefficients

# find the model with glm()
m <- glm(high_use ~ age + Medu + studytime + absences, data = alc, family = "binomial")


# print out a summary of the fitted model
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + Medu + studytime + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2166  -0.8430  -0.6547   1.1573   2.2772  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.798207   1.786727  -1.566 0.117323    
## age          0.164398   0.103533   1.588 0.112316    
## Medu        -0.003725   0.111021  -0.034 0.973232    
## studytime   -0.579521   0.158539  -3.655 0.000257 ***
## absences     0.075027   0.023247   3.227 0.001249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 417.24  on 365  degrees of freedom
## AIC: 427.24
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
##  (Intercept)          age         Medu    studytime     absences 
## -2.798207282  0.164397602 -0.003725333 -0.579521121  0.075027298
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %   97.5 %
## (Intercept) 0.06091918 0.001766207 1.979230
## age         1.17868287 0.963461160 1.447356
## Medu        0.99628160 0.801834321 1.240253
## studytime   0.56016655 0.406319506 0.757511
## absences    1.07791358 1.032143878 1.130799

Based on the negative coefficient estimates of the summary statistics of the fitted model, higher values of mother’s education and student’s study time are associated with a lower likelihood of the high_use variable being TRUE and the opposite for the age and absences variables. However, only the coefficients for study time and absences are statistically significant. Therefore from the summary of the model one could say that the more time time a student spends studying the lower the likelihood of them having high usage of alcohol. And the higher the number of absence days of the student is the higher the likelihood that their alcohol usage is high. Based on these results my hypothesis about mother’s education and age were wrong and the hypothesis on study time and absences were right.

Based on the odds ratio for a one year increase in age the odds that the student have high alcohol usage increases by a factor of 1.18 when other variables stay stable. And the same idea for other variables. So each additional education level the odds that student has high alcohol usage increases by a factor of 1.00 meaning that there is no change in the alcohol usage, for each hour spent studying the odds decrease by a factor of 0.56 and for every increased absence the odds increase by a factor of 1.08 . Based on these results the biggest increases or decreases in odds are caused by age (the higher the age the higher the alcohol usage) and study time (the higher the study time the lower the alcohol usage).The confidence intervals for the intercept, age, Medu, studytime and absences are 0.00-1.98, 0.96-1.45, 0.80-1.24, 0.41-0.76 and 1.03-1.13 respectively. Interpretation for these values are that we are 95% confident that the correlation between the given variable and target variable is between the values of those confidence intervals. For example we are 95% confident that the correlation between age and alcohol usage is between 0.96-1.45. This is pretty high interval and therefore we don’t have very good knowledge of the effect. Same goes for mother’s education. The confidence intervals for study time and absences are smaller meaning we have better knowledge of their effect on the target variable (alcohol use).Based on these results my hypothesis about mother’s education was wrong and the hypotheses on age, study time and absences were right.

Exploring the predictive power of the model

# find the model with glm() for only the statistically significant study time and absences
m <- glm(high_use ~ studytime + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, studytime, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 5
##    studytime absences high_use probability prediction
##        <dbl>    <dbl> <lgl>          <dbl> <lgl>     
##  1         2        3 FALSE          0.266 FALSE     
##  2         1        0 FALSE          0.336 FALSE     
##  3         1        7 TRUE           0.468 FALSE     
##  4         3        1 FALSE          0.149 FALSE     
##  5         1        6 FALSE          0.448 FALSE     
##  6         3        2 FALSE          0.159 FALSE     
##  7         2        2 FALSE          0.251 FALSE     
##  8         2        3 FALSE          0.266 FALSE     
##  9         1        4 TRUE           0.410 FALSE     
## 10         1        2 TRUE           0.372 FALSE
#2x2 tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     93   18
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2810811

Based on the cross tabulation/confusion matrix the model predicts farely well the cases were alcohol usage is not high only 11 false positives out of the 259 students that reported low alcohol usage. The model does not predict the high use cases very well as out of the 111 high use cases it only got 18 right and 93 false positives.

The model has a prediction error of 0.28 which means that it makes wrong predictions 28% of the times. This is a rather high prediction error and I would conclude that this model does not predict very well and therefore is not very powerful.

I think the “error rate of my guesses” and of the model were pretty similar. I would say that the model is slightly better for prediction than guessing but would have much room for improvement. This could maybe be done for example by including other variables.


Assignment 4 analysis exercises

Reading in and description of the data

# Install required packages:
#install.packages(c("MASS", "corrplot"))
#install.packages("plotly")
# Access required packages
library(MASS)
## Warning: package 'MASS' was built under R version 4.2.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyverse)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
library(ggplot2)
library(dplyr)
library(GGally)

# Load the Boston data from the MASS package
data("Boston")

# Explore the structure and the dimensions of the data 
glimpse(Boston) # 506 rows amd 14 columns
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset used in this assingment and the explanations for variablenames are found in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html The dataset has 506 rows and 14 columns and in includes information on the housing values in suburbs of Boston. The columns include for example per capita crime rate by town (crim) and average number of rooms per dwelling (rm).

# Show a graphical overview of the data 
pairs(Boston)

p <- ggpairs(Boston)
# draw the plot
p

# Show summaries of the variables in the data.
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Based on the plot matrix it seems that all variables have statistically significant correlations with each other except chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) which doesn’t correlate at all or has a weaker correlation than that of other variables. Chas has values only of 0 and 1. From the density plots we can see for example that crim, zn, chas, dis, rad, tax and lstat are mostly very low (although some have a second small peak), only rm looks to be normally distributed, indus has two even peaks, and age, ptratio and black have mostly higher values.

Same trends are showing in the corrplot. However, maybe the strength (intensity of color) and sign of the correlation (blue for pos and red for neg) is more easily visible. Strongest negative correlations (meaning when one gets higher the other gets lower) seem to be between indus and dis, nox and dis, age and dis, and lstat and medv. Strongest positive correlation (one gets higher the other gets higher too) seem to be between rad and tax.

# Standardize the dataset
boston_scaled <- scale(Boston)

# Print out summaries of the scaled data 
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime', using the quantiles as the break points in the categorical variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# Drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

As a result of scaling the variable values got smaller and the mean of each variable is zero. This is because in the scaling the column means were subtracted from the corresponding columns and the difference was divided with standard deviation.

# Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2648515 0.2524752 0.2400990 0.2425743 
## 
## Group means:
##                  zn      indus         chas        nox          rm        age
## low       1.0425119 -0.9324407 -0.125147750 -0.9144006  0.43298890 -0.9436196
## med_low  -0.1377075 -0.2913322 -0.002135914 -0.5332699 -0.12603728 -0.2923672
## med_high -0.3970656  0.1419969  0.295912203  0.3470225  0.07909257  0.3517509
## high     -0.4872402  1.0171960 -0.111631099  1.0364452 -0.44725159  0.8238125
##                 dis        rad        tax     ptratio       black        lstat
## low       0.9362288 -0.6856311 -0.7298304 -0.46856241  0.37794437 -0.770321569
## med_low   0.2901981 -0.5371217 -0.4556621 -0.02836836  0.32040425 -0.095144929
## med_high -0.3190184 -0.4005338 -0.3199151 -0.21803248  0.06419314  0.006500615
## high     -0.8689249  1.6373367  1.5134896  0.77985517 -0.80486847  0.904898627
##                  medv
## low       0.544248645
## med_low  -0.009323315
## med_high  0.125675806
## high     -0.725511224
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.10940184  0.78998890 -0.80610291
## indus    0.09411342 -0.34700408  0.14517304
## chas    -0.09805521 -0.11561847  0.05120981
## nox      0.35189831 -0.58580477 -1.54803563
## rm      -0.07349454 -0.09743556 -0.11877500
## age      0.26321529 -0.28388078 -0.13913589
## dis     -0.06967575 -0.26039071 -0.04981722
## rad      3.22101965  0.80867750 -0.08880470
## tax     -0.10560117  0.18538385  0.69765678
## ptratio  0.11068735 -0.01683802 -0.24489293
## black   -0.13845163  0.01545825  0.16342682
## lstat    0.26020548 -0.19607187  0.38544957
## medv     0.19690181 -0.25493477 -0.26889697
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9463 0.0411 0.0126
#Draw the LDA (bi)plot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

# Save the crime categories from the test set 
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# Then predict the classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results with the crime categories from the test set.
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low        8      10        2    0
##   med_low    5      15        4    0
##   med_high   0       5       23    1
##   high       0       0        0   29
#Comment on the results. 

Based on the cross tabulations the LDA model worked best on predicting the high crime rates as it predicted all of them right. Of the low crime rates the model predicted 15 out of 25 right, of the med_low 17 out of 28 and of the med_high rates 10 out of 23 were predicted right by the LDA model.

# Reload the Boston dataset 
data("Boston")

# Standardize the dataset to get comparable distances
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# Calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#Run k-means algorithm on the dataset.
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

#Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') # here it seems 2 is an optimal number of clusters
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

#p <- ggpairs(boston_scaled, col = km$cluster)
#p

The optimal number of clusters is visible in the qplot where the line drops suddenly. In this case it is around 2. Therefore 2 is the optimal number of clusters.

# Standardize the dataset
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# add the cluster value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)

# Perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. 
# linear discriminant analysis
lda.fit <- lda(km.cluster ~ ., data = boston_scaled[,-4])   #here I got an error of variable 4 (aka chas) being constant within groups and that is why I removed it. I am not sure if this is right way to do it but now it doesn't give me an error

# print the lda.fit object
lda.fit
## Call:
## lda(km.cluster ~ ., data = boston_scaled[, -4])
## 
## Prior probabilities of groups:
##          1          2          3 
## 0.06916996 0.61067194 0.32015810 
## 
## Group means:
##         crim         zn      indus        nox         rm        age        dis
## 1 -0.2048299 -0.1564737  0.2306535  0.3342374  0.3344149  0.3170678 -0.3634565
## 2 -0.3882449  0.2731699 -0.6264383 -0.5823006  0.2188304 -0.4585819  0.4807157
## 3  0.7847946 -0.4872402  1.1450405  1.0384727 -0.4896488  0.8062002 -0.8383961
##           rad        tax    ptratio      black      lstat       medv
## 1 -0.02700292 -0.1304164 -0.4453253  0.1787986 -0.1976385  0.6422884
## 2 -0.58641200 -0.6161585 -0.2814183  0.3151747 -0.4640135  0.3182241
## 3  1.12436056  1.2034416  0.6329916 -0.6397959  0.9277624 -0.7457491
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim     0.02479166 -0.13204141
## zn       0.42787622 -0.04638198
## indus    1.15646011  0.64348753
## nox      0.47943272  0.42987408
## rm       0.13637610 -0.12823804
## age     -0.06654278  0.30029385
## dis     -0.01915297  0.20848367
## rad      0.74637979  0.69845574
## tax      0.27967651 -1.02040695
## ptratio  0.19355485 -0.21964359
## black   -0.04753224  0.11581547
## lstat    0.47213016  0.02172623
## medv     0.06797263  0.92531426
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9822 0.0178
# Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution).
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}


# plot the lda results 
plot(lda.fit, dimen = 2, col = boston_scaled$km.cluster, pch = km$cluster)
lda.arrows(lda.fit, myscale = 2)

# Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)

Based on the biplot it looks like the variables medv, rad, indus and tax are the most influential variables which shows inthe plot as longer arrows.

#Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  2
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
#Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. 
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
#Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
#kmeans for train set
# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
set.seed(123)
km_train <- kmeans(train[,-14], centers = 3)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_train$cluster)
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf

Yes, the plots differ in that the dots color in different way in the plots but the spatial location of the dots seem to stay the same in both the plots.